{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Tranlation Matrix Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## What is it ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Suppose we are given a set of word pairs and their associated vector representaion $\\{x_{i},z_{i}\\}_{i=1}^{n}$, where $x_{i} \\in R^{d_{1}}$ is the distibuted representation of word $i$ in the source language, and ${z_{i} \\in R^{d_{2}}}$ is the vector representation of its translation. Our goal is to find a transformation matrix $W$ such that $Wx_{i}$ approximates $z_{i}$. In practice, $W$ can be learned by the following optimization prolem:\n",
    "\n",
    "<center>$\\min \\limits_{W} \\sum \\limits_{i=1}^{n} ||Wx_{i}-z_{i}||^{2}$</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Resources"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Tomas Mikolov, Quoc V Le, Ilya Sutskever. 2013.[Exploiting Similarities among Languages for Machine Translation](https://arxiv.org/pdf/1309.4168.pdf)\n",
    "\n",
    "Georgiana Dinu, Angelikie Lazaridou and Marco Baroni. 2014.[Improving zero-shot learning by mitigating the hubness problem](https://arxiv.org/pdf/1309.4168.pdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using Theano backend.\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "\n",
    "from gensim import utils\n",
    "from gensim.models import translation_matrix\n",
    "from gensim.models import KeyedVectors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "For this tutorial, we'll train our model using the English -> Italian word pairs from the OPUS collection. This corpus contains 5000 word pairs. Each word pair is English word with corresponding Italian word.\n",
    "\n",
    "Dataset download: \n",
    "\n",
    "[OPUS_en_it_europarl_train_5K.txt](https://pan.baidu.com/s/1nuIuQoT)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(u'for', u'per'), (u'that', u'che'), (u'with', u'con'), (u'are', u'are'), (u'are', u'sono'), (u'this', u'questa'), (u'this', u'questo'), (u'you', u'lei'), (u'not', u'non'), (u'which', u'che')]\n"
     ]
    }
   ],
   "source": [
    "train_file = \"OPUS_en_it_europarl_train_5K.txt\"\n",
    "\n",
    "with utils.smart_open(train_file, \"r\") as f:\n",
    "    word_pair = [tuple(utils.to_unicode(line).strip().split()) for line in f]\n",
    "print (word_pair[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "This tutorial uses 300-dimensional vectors of English words as source and vectors of Italian words as target. (Those vector trained by the word2vec toolkit with cbow. The context window was set 5 words to either side of the target,\n",
    "the sub-sampling option was set to 1e-05 and estimate the probability of a target word with the negative sampling method, drawing 10 samples from the noise distribution)\n",
    "\n",
    "Download dataset:\n",
    "\n",
    "[EN.200K.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt](https://pan.baidu.com/s/1nv3bYel)\n",
    "\n",
    "[IT.200K.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt](https://pan.baidu.com/s/1boP0P7D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Load the source language word vector\n",
    "source_word_vec_file = \"EN.200K.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt\"\n",
    "source_word_vec = KeyedVectors.load_word2vec_format(source_word_vec_file, binary=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Load the target language word vector\n",
    "target_word_vec_file = \"IT.200K.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt\"\n",
    "target_word_vec = KeyedVectors.load_word2vec_format(target_word_vec_file, binary=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Train the translation matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('the shape of translation matrix is: ', (300, 300))\n"
     ]
    }
   ],
   "source": [
    "transmat = translation_matrix.TranslationMatrix(source_word_vec, target_word_vec, word_pair)\n",
    "transmat.train(word_pair)\n",
    "print (\"the shape of translation matrix is: \", transmat.translation_matrix.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Prediction Time: For any given new word, we can map it to the other language space by coputing $z = Wx$, then we find the word whose representation is closet to z in the target language space, using consine similarity as the distance metric."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "#### Part one:\n",
    "Let's look at some vocabulary of numbers translation. We use English words (one, two, three, four and five) as test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/robotcator/PycharmProjects/gensim/gensim/models/translation_matrix.py:223: UserWarning: The parameter source_lang_vec isn't specified, use the model's source language word vector as default.\n",
      "  warnings.warn(\"The parameter source_lang_vec isn't specified, use the model's source language word vector as default.\")\n",
      "/home/robotcator/PycharmProjects/gensim/gensim/models/translation_matrix.py:227: UserWarning: The parameter target_lang_vec isn't specified, use the model's target language word vector as default.\n",
      "  warnings.warn(\"The parameter target_lang_vec isn't specified, use the model's target language word vector as default.\")\n"
     ]
    }
   ],
   "source": [
    "# The pair is in the form of (English, Italian), we can see whether the translated word is correct\n",
    "words = [(\"one\", \"uno\"), (\"two\", \"due\"), (\"three\", \"tre\"), (\"four\", \"quattro\"), (\"five\", \"cinque\")]\n",
    "source_word, target_word = zip(*words)\n",
    "translated_word = transmat.translate(source_word, 5, )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('word ', 'one', ' and translated word', [u'solo', u'due', u'tre', u'cinque', u'quattro'])\n",
      "('word ', 'two', ' and translated word', [u'due', u'tre', u'quattro', u'cinque', u'otto'])\n",
      "('word ', 'three', ' and translated word', [u'tre', u'quattro', u'due', u'cinque', u'sette'])\n",
      "('word ', 'four', ' and translated word', [u'tre', u'quattro', u'cinque', u'due', u'sette'])\n",
      "('word ', 'five', ' and translated word', [u'cinque', u'tre', u'quattro', u'otto', u'dieci'])\n"
     ]
    }
   ],
   "source": [
    "for k, v in translated_word.iteritems():\n",
    "    print (\"word \", k, \" and translated word\", v)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "#### Part two:\n",
    "Let's look at some vocabulary of fruits translation. We use English words (apple, orange, grape, banana and mango) as test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "word  apple  and translated word [u'apple', u'mela', u'microsoft', u'macintosh', u'turbolinux']\n",
      "word  orange  and translated word [u'arancione', u'curacao', u'aranciato', u'gialloarancio', u'bluastro']\n",
      "word  grape  and translated word [u'sylvaner', u'vinsanto', u'marzemino', u'traminer', u'legume']\n",
      "word  banana  and translated word [u'anacardi', u'papaia', u'muesli', u'manioca', u'basmati']\n",
      "word  mango  and translated word [u'patchouli', u'anacardi', u'papaia', u'tamarindo', u'guacamole']\n"
     ]
    }
   ],
   "source": [
    "words = [(\"apple\", \"mela\"), (\"orange\", \"arancione\"), (\"grape\", \"acino\"), (\"banana\", \"banana\"), (\"mango\", \"mango\")]\n",
    "source_word, target_word = zip(*words)\n",
    "translated_word = transmat.translate(source_word, 5)\n",
    "for k, v in translated_word.iteritems():\n",
    "    print (\"word \", k, \" and translated word\", v)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "#### Part three:\n",
    "Let's look at some vocabulary of animals translation. We use English words (dog, pig, cat, horse and bird) as test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "word  dog  and translated word [u'cane', u'cani', u'cagnolino', u'barboncino', u'micio']\n",
      "word  pig  and translated word [u'maiali', u'maialini', u'animale', u'barboncino', u'vitella']\n",
      "word  cat  and translated word [u'gatto', u'gattino', u'micio', u'cagnolino', u'cane']\n",
      "word  fish  and translated word [u'krill', u'pesce', u'pesci', u'storioni', u'alborelle']\n",
      "word  birds  and translated word [u'uccelli', u'passeriformi', u'antilopi', u'rettili', u'svassi']\n"
     ]
    }
   ],
   "source": [
    "words = [(\"dog\", \"cane\"), (\"pig\", \"maiale\"), (\"cat\", \"gatto\"), (\"fish\", \"cavallo\"), (\"birds\", \"uccelli\")]\n",
    "source_word, target_word = zip(*words)\n",
    "translated_word = transmat.translate(source_word, 5)\n",
    "for k, v in translated_word.iteritems():\n",
    "    print (\"word \", k, \" and translated word\", v)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### The Creation Time for the Translation Matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Testing the creation time, we extracted more word pairs from a dictionary built from Europarl([Europara, en-it](http://opus.lingfil.uu.se/)). We obtain about 20K word pairs and their coresponding word vectors or you can download from this.[word_dict.pkl](https://pan.baidu.com/s/1dF8HUX7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the length of word pair  20942\n"
     ]
    }
   ],
   "source": [
    "import pickle\n",
    "word_dict = \"word_dict.pkl\"\n",
    "with utils.smart_open(word_dict, \"r\") as f:\n",
    "    word_pair = pickle.load(f)\n",
    "print (\"the length of word pair \", len(word_pair))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "test_case = 10\n",
    "word_pair_length = len(word_pair)\n",
    "step = word_pair_length / test_case\n",
    "\n",
    "duration = []\n",
    "sizeofword = []\n",
    "\n",
    "for idx in xrange(0, test_case):\n",
    "    sub_pair = word_pair[: (idx + 1) * step]\n",
    "\n",
    "    startTime = time.time()\n",
    "    transmat = translation_matrix.TranslationMatrix(source_word_vec, target_word_vec, sub_pair)\n",
    "    transmat.train(sub_pair)\n",
    "    endTime = time.time()\n",
    "    \n",
    "    sizeofword.append(len(sub_pair))\n",
    "    duration.append(endTime - startTime)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<script>requirejs.config({paths: { 'plotly': ['https://cdn.plot.ly/plotly-latest.min']},});if(!window.Plotly) {{require(['plotly'],function(plotly) {window.Plotly=plotly;});}}</script>"
      ],
      "text/vnd.plotly.v1+html": [
       "<script>requirejs.config({paths: { 'plotly': ['https://cdn.plot.ly/plotly-latest.min']},});if(!window.Plotly) {{require(['plotly'],function(plotly) {window.Plotly=plotly;});}}</script>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.plotly.v1+json": {
       "data": [
        {
         "type": "scatter",
         "x": [
          2094,
          4188,
          6282,
          8376,
          10470,
          12564,
          14658,
          16752,
          18846,
          20940
         ],
         "y": [
          0.5877759456634521,
          0.8401670455932617,
          0.9247369766235352,
          1.2453999519348145,
          1.60801100730896,
          1.892496109008789,
          2.141044855117798,
          2.1962528228759766,
          2.7086141109466553,
          3.2611770629882812
         ]
        }
       ],
       "layout": {
        "title": "time for creation"
       }
      },
      "text/html": [
       "<div id=\"53cb2abd-fa36-4248-8334-07bc8c053527\" style=\"height: 525px; width: 100%;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"53cb2abd-fa36-4248-8334-07bc8c053527\", [{\"y\": [0.5877759456634521, 0.8401670455932617, 0.9247369766235352, 1.2453999519348145, 1.60801100730896, 1.892496109008789, 2.141044855117798, 2.1962528228759766, 2.7086141109466553, 3.2611770629882812], \"x\": [2094, 4188, 6282, 8376, 10470, 12564, 14658, 16752, 18846, 20940], \"type\": \"scatter\"}], {\"title\": \"time for creation\"}, {\"linkText\": \"Export to plot.ly\", \"showLink\": true})});</script>"
      ],
      "text/vnd.plotly.v1+html": [
       "<div id=\"53cb2abd-fa36-4248-8334-07bc8c053527\" style=\"height: 525px; width: 100%;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"53cb2abd-fa36-4248-8334-07bc8c053527\", [{\"y\": [0.5877759456634521, 0.8401670455932617, 0.9247369766235352, 1.2453999519348145, 1.60801100730896, 1.892496109008789, 2.141044855117798, 2.1962528228759766, 2.7086141109466553, 3.2611770629882812], \"x\": [2094, 4188, 6282, 8376, 10470, 12564, 14658, 16752, 18846, 20940], \"type\": \"scatter\"}], {\"title\": \"time for creation\"}, {\"linkText\": \"Export to plot.ly\", \"showLink\": true})});</script>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import plotly\n",
    "from plotly.graph_objs import Scatter, Layout\n",
    "\n",
    "plotly.offline.init_notebook_mode(connected=True)\n",
    "\n",
    "plotly.offline.iplot({\n",
    "    \"data\": [Scatter(x=sizeofword, y=duration)],\n",
    "    \"layout\": Layout(title=\"time for creation\"),\n",
    "}, filename=\"tm_creation_time.html\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "You will see a two dimensional coordination whose horizontal axis is the size of corpus and vertical axis is the time to train a translation matrix (the unit is second). As the size of corpus increases, the time increases linearly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "source": [
    "### Linear Relationship Between Languages"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "To have a better understanding of the principles behind, we visualized the word vectors using PCA, we noticed that the vector representations of similar words in different languages were related by a linear transformation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<script>requirejs.config({paths: { 'plotly': ['https://cdn.plot.ly/plotly-latest.min']},});if(!window.Plotly) {{require(['plotly'],function(plotly) {window.Plotly=plotly;});}}</script>"
      ],
      "text/vnd.plotly.v1+html": [
       "<script>requirejs.config({paths: { 'plotly': ['https://cdn.plot.ly/plotly-latest.min']},});if(!window.Plotly) {{require(['plotly'],function(plotly) {window.Plotly=plotly;});}}</script>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "\n",
    "import plotly\n",
    "from plotly.graph_objs import Scatter, Layout, Figure\n",
    "plotly.offline.init_notebook_mode(connected=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "words = [(\"one\", \"uno\"), (\"two\", \"due\"), (\"three\", \"tre\"), (\"four\", \"quattro\"), (\"five\", \"cinque\")]\n",
    "en_words_vec = [source_word_vec[item[0]] for item in words]\n",
    "it_words_vec = [target_word_vec[item[1]] for item in words]\n",
    "\n",
    "en_words, it_words = zip(*words)\n",
    "\n",
    "pca = PCA(n_components=2)\n",
    "new_en_words_vec = pca.fit_transform(en_words_vec)\n",
    "new_it_words_vec = pca.fit_transform(it_words_vec)\n",
    "\n",
    "# remove the code, use the plotly for ploting instead\n",
    "# fig = plt.figure()\n",
    "# fig.add_subplot(121)\n",
    "# plt.scatter(new_en_words_vec[:, 0], new_en_words_vec[:, 1])\n",
    "# for idx, item in enumerate(en_words):\n",
    "#     plt.annotate(item, xy=(new_en_words_vec[idx][0], new_en_words_vec[idx][1]))\n",
    "\n",
    "# fig.add_subplot(122)\n",
    "# plt.scatter(new_it_words_vec[:, 0], new_it_words_vec[:, 1])\n",
    "# for idx, item in enumerate(it_words):\n",
    "#     plt.annotate(item, xy=(new_it_words_vec[idx][0], new_it_words_vec[idx][1]))\n",
    "# plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.plotly.v1+json": {
       "data": [
        {
         "mode": "markers+text",
         "text": [
          "one",
          "two",
          "three",
          "four",
          "five"
         ],
         "textposition": "top",
         "type": "scatter",
         "x": [
          1.3152927124584963,
          -0.2135391946410942,
          -0.41853131679090877,
          -0.44261965826408217,
          -0.24060254276241194
         ],
         "y": [
          -0.012982020585119488,
          0.6946434653689207,
          0.07943642028572512,
          -0.07689106511677166,
          -0.6842067999527552
         ]
        },
        {
         "mode": "markers+text",
         "text": [
          "uno",
          "due",
          "tre",
          "quattro",
          "cinque"
         ],
         "textposition": "top",
         "type": "scatter",
         "x": [
          2.027197813185206,
          -0.29054172798853567,
          -0.5369809554860695,
          -0.6418108143913539,
          -0.5578643153192474
         ],
         "y": [
          -0.09450346833782378,
          0.853948136638104,
          0.15055229620593427,
          -0.1533556973430418,
          -0.756641267163173
         ]
        }
       ],
       "layout": {
        "showlegend": false
       }
      },
      "text/html": [
       "<div id=\"705159a5-2ba2-46ae-bda1-db672d0bbe4b\" style=\"height: 525px; width: 100%;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"705159a5-2ba2-46ae-bda1-db672d0bbe4b\", [{\"textposition\": \"top\", \"text\": [\"one\", \"two\", \"three\", \"four\", \"five\"], \"mode\": \"markers+text\", \"y\": [-0.012982020585119488, 0.6946434653689207, 0.07943642028572512, -0.07689106511677166, -0.6842067999527552], \"x\": [1.3152927124584963, -0.2135391946410942, -0.41853131679090877, -0.44261965826408217, -0.24060254276241194], \"type\": \"scatter\"}, {\"textposition\": \"top\", \"text\": [\"uno\", \"due\", \"tre\", \"quattro\", \"cinque\"], \"mode\": \"markers+text\", \"y\": [-0.09450346833782378, 0.853948136638104, 0.15055229620593427, -0.1533556973430418, -0.756641267163173], \"x\": [2.027197813185206, -0.29054172798853567, -0.5369809554860695, -0.6418108143913539, -0.5578643153192474], \"type\": \"scatter\"}], {\"showlegend\": false}, {\"linkText\": \"Export to plot.ly\", \"showLink\": true})});</script>"
      ],
      "text/vnd.plotly.v1+html": [
       "<div id=\"705159a5-2ba2-46ae-bda1-db672d0bbe4b\" style=\"height: 525px; width: 100%;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"705159a5-2ba2-46ae-bda1-db672d0bbe4b\", [{\"textposition\": \"top\", \"text\": [\"one\", \"two\", \"three\", \"four\", \"five\"], \"mode\": \"markers+text\", \"y\": [-0.012982020585119488, 0.6946434653689207, 0.07943642028572512, -0.07689106511677166, -0.6842067999527552], \"x\": [1.3152927124584963, -0.2135391946410942, -0.41853131679090877, -0.44261965826408217, -0.24060254276241194], \"type\": \"scatter\"}, {\"textposition\": \"top\", \"text\": [\"uno\", \"due\", \"tre\", \"quattro\", \"cinque\"], \"mode\": \"markers+text\", \"y\": [-0.09450346833782378, 0.853948136638104, 0.15055229620593427, -0.1533556973430418, -0.756641267163173], \"x\": [2.027197813185206, -0.29054172798853567, -0.5369809554860695, -0.6418108143913539, -0.5578643153192474], \"type\": \"scatter\"}], {\"showlegend\": false}, {\"linkText\": \"Export to plot.ly\", \"showLink\": true})});</script>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# you can also using plotly lib to plot in one figure\n",
    "trace1 = Scatter(\n",
    "    x = new_en_words_vec[:, 0],\n",
    "    y = new_en_words_vec[:, 1],\n",
    "    mode = 'markers+text',\n",
    "    text = en_words,\n",
    "    textposition = 'top'\n",
    ")\n",
    "trace2 = Scatter(\n",
    "    x = new_it_words_vec[:, 0],\n",
    "    y = new_it_words_vec[:, 1],\n",
    "    mode = 'markers+text',\n",
    "    text = it_words,\n",
    "    textposition = 'top'\n",
    ")\n",
    "layout = Layout(\n",
    "    showlegend = False\n",
    ")\n",
    "data = [trace1, trace2]\n",
    "\n",
    "fig = Figure(data=data, layout=layout)\n",
    "plot_url = plotly.offline.iplot(fig, filename='relatie_position_for_number.html')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The figure shows that the word vectors for English number one to five and the corresponding Italian words uno to cinque have similar geometric arrangements. So the relationship between vector spaces that represent these two languages can be captured by linear mapping. \n",
    "If we know the translation of one to four from English to Italian, we can learn the transformation matrix that can help us to translate five or other numbers to the Italian word."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "translation of five:  OrderedDict([('five', [u'cinque', u'quattro', u'tre'])])\n"
     ]
    }
   ],
   "source": [
    "words = [(\"one\", \"uno\"), (\"two\", \"due\"), (\"three\", \"tre\"), (\"four\", \"quattro\"), (\"five\", \"cinque\")]\n",
    "en_words, it_words = zip(*words)\n",
    "en_words_vec = [source_word_vec[item[0]] for item in words]\n",
    "it_words_vec = [target_word_vec[item[1]] for item in words]\n",
    "\n",
    "# Translate the English word five to Italian word\n",
    "translated_word = transmat.translate([en_words[4]], 3)\n",
    "print \"translation of five: \", translated_word\n",
    "\n",
    "# the translated words of five\n",
    "for item in translated_word[en_words[4]]:\n",
    "    it_words_vec.append(target_word_vec[item])\n",
    "\n",
    "pca = PCA(n_components=2)\n",
    "new_en_words_vec = pca.fit_transform(en_words_vec)\n",
    "new_it_words_vec = pca.fit_transform(it_words_vec)\n",
    "\n",
    "# remove the code, use the plotly for ploting instead\n",
    "# fig = plt.figure()\n",
    "# fig.add_subplot(121)\n",
    "# plt.scatter(new_en_words_vec[:, 0], new_en_words_vec[:, 1])\n",
    "# for idx, item in enumerate(en_words):\n",
    "#     plt.annotate(item, xy=(new_en_words_vec[idx][0], new_en_words_vec[idx][1]))\n",
    "\n",
    "# fig.add_subplot(122)\n",
    "# plt.scatter(new_it_words_vec[:, 0], new_it_words_vec[:, 1])\n",
    "# for idx, item in enumerate(it_words):\n",
    "#     plt.annotate(item, xy=(new_it_words_vec[idx][0], new_it_words_vec[idx][1]))\n",
    "# # annote for the translation of five, the red text annotation is the translation of five\n",
    "# for idx, item in enumerate(translated_word[en_words[4]]):\n",
    "#     plt.annotate(item, xy=(new_it_words_vec[idx + 5][0], new_it_words_vec[idx + 5][1]),\n",
    "#                  xytext=(new_it_words_vec[idx + 5][0] + 0.1, new_it_words_vec[idx + 5][1] + 0.1),\n",
    "#                  color=\"red\",\n",
    "#                  arrowprops=dict(facecolor='red', shrink=0.1, width=1, headwidth=2),)\n",
    "# plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.plotly.v1+json": {
       "data": [
        {
         "mode": "markers+text",
         "text": [
          "one",
          "two",
          "three",
          "four",
          "five"
         ],
         "textposition": "top",
         "type": "scatter",
         "x": [
          1.3152927124584963,
          -0.2135391946410942,
          -0.41853131679090877,
          -0.44261965826408217,
          -0.24060254276241194
         ],
         "y": [
          -0.012982020585119488,
          0.6946434653689207,
          0.07943642028572512,
          -0.07689106511677166,
          -0.6842067999527552
         ]
        },
        {
         "mode": "markers+text",
         "text": [
          "uno",
          "due",
          "tre",
          "quattro",
          "cinque"
         ],
         "textposition": "top",
         "type": "scatter",
         "x": [
          2.238256809135214,
          -0.001303773153150972,
          -0.3056676679567372,
          -0.43875310919367744,
          -0.374055740840617,
          -0.374055740840617,
          -0.43875310919367744,
          -0.3056676679567372
         ],
         "y": [
          -0.1257325191472076,
          0.7888307462022796,
          0.33178257213847573,
          0.08474468776781709,
          -0.7480763734338287,
          -0.7480763734338287,
          0.08474468776781709,
          0.33178257213847556
         ]
        }
       ],
       "layout": {
        "annotations": [
         {
          "arrowcolor": "black",
          "arrowhead": 0.5,
          "arrowsize": 1.5,
          "arrowwidth": 1,
          "text": "cinque",
          "x": -0.374055740840617,
          "y": -0.7480763734338287
         },
         {
          "arrowcolor": "black",
          "arrowhead": 0.5,
          "arrowsize": 1.5,
          "arrowwidth": 1,
          "text": "quattro",
          "x": -0.43875310919367744,
          "y": 0.08474468776781709
         },
         {
          "arrowcolor": "black",
          "arrowhead": 0.5,
          "arrowsize": 1.5,
          "arrowwidth": 1,
          "text": "tre",
          "x": -0.3056676679567372,
          "y": 0.33178257213847556
         }
        ],
        "showlegend": false
       }
      },
      "text/html": [
       "<div id=\"1f556b5c-ed59-407a-b715-8103f7fb9d69\" style=\"height: 525px; width: 100%;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"1f556b5c-ed59-407a-b715-8103f7fb9d69\", [{\"textposition\": \"top\", \"text\": [\"one\", \"two\", \"three\", \"four\", \"five\"], \"mode\": \"markers+text\", \"y\": [-0.012982020585119488, 0.6946434653689207, 0.07943642028572512, -0.07689106511677166, -0.6842067999527552], \"x\": [1.3152927124584963, -0.2135391946410942, -0.41853131679090877, -0.44261965826408217, -0.24060254276241194], \"type\": \"scatter\"}, {\"textposition\": \"top\", \"text\": [\"uno\", \"due\", \"tre\", \"quattro\", \"cinque\"], \"mode\": \"markers+text\", \"y\": [-0.1257325191472076, 0.7888307462022796, 0.33178257213847573, 0.08474468776781709, -0.7480763734338287, -0.7480763734338287, 0.08474468776781709, 0.33178257213847556], \"x\": [2.238256809135214, -0.001303773153150972, -0.3056676679567372, -0.43875310919367744, -0.374055740840617, -0.374055740840617, -0.43875310919367744, -0.3056676679567372], \"type\": \"scatter\"}], {\"showlegend\": false, \"annotations\": [{\"arrowhead\": 0.5, \"text\": \"cinque\", \"arrowsize\": 1.5, \"y\": -0.7480763734338287, \"x\": -0.374055740840617, \"arrowwidth\": 1, \"arrowcolor\": \"black\"}, {\"arrowhead\": 0.5, \"text\": \"quattro\", \"arrowsize\": 1.5, \"y\": 0.08474468776781709, \"x\": -0.43875310919367744, \"arrowwidth\": 1, \"arrowcolor\": \"black\"}, {\"arrowhead\": 0.5, \"text\": \"tre\", \"arrowsize\": 1.5, \"y\": 0.33178257213847556, \"x\": -0.3056676679567372, \"arrowwidth\": 1, \"arrowcolor\": \"black\"}]}, {\"linkText\": \"Export to plot.ly\", \"showLink\": true})});</script>"
      ],
      "text/vnd.plotly.v1+html": [
       "<div id=\"1f556b5c-ed59-407a-b715-8103f7fb9d69\" style=\"height: 525px; width: 100%;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"1f556b5c-ed59-407a-b715-8103f7fb9d69\", [{\"textposition\": \"top\", \"text\": [\"one\", \"two\", \"three\", \"four\", \"five\"], \"mode\": \"markers+text\", \"y\": [-0.012982020585119488, 0.6946434653689207, 0.07943642028572512, -0.07689106511677166, -0.6842067999527552], \"x\": [1.3152927124584963, -0.2135391946410942, -0.41853131679090877, -0.44261965826408217, -0.24060254276241194], \"type\": \"scatter\"}, {\"textposition\": \"top\", \"text\": [\"uno\", \"due\", \"tre\", \"quattro\", \"cinque\"], \"mode\": \"markers+text\", \"y\": [-0.1257325191472076, 0.7888307462022796, 0.33178257213847573, 0.08474468776781709, -0.7480763734338287, -0.7480763734338287, 0.08474468776781709, 0.33178257213847556], \"x\": [2.238256809135214, -0.001303773153150972, -0.3056676679567372, -0.43875310919367744, -0.374055740840617, -0.374055740840617, -0.43875310919367744, -0.3056676679567372], \"type\": \"scatter\"}], {\"showlegend\": false, \"annotations\": [{\"arrowhead\": 0.5, \"text\": \"cinque\", \"arrowsize\": 1.5, \"y\": -0.7480763734338287, \"x\": -0.374055740840617, \"arrowwidth\": 1, \"arrowcolor\": \"black\"}, {\"arrowhead\": 0.5, \"text\": \"quattro\", \"arrowsize\": 1.5, \"y\": 0.08474468776781709, \"x\": -0.43875310919367744, \"arrowwidth\": 1, \"arrowcolor\": \"black\"}, {\"arrowhead\": 0.5, \"text\": \"tre\", \"arrowsize\": 1.5, \"y\": 0.33178257213847556, \"x\": -0.3056676679567372, \"arrowwidth\": 1, \"arrowcolor\": \"black\"}]}, {\"linkText\": \"Export to plot.ly\", \"showLink\": true})});</script>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "trace1 = Scatter(\n",
    "    x = new_en_words_vec[:, 0],\n",
    "    y = new_en_words_vec[:, 1],\n",
    "    mode = 'markers+text',\n",
    "    text = en_words,\n",
    "    textposition = 'top'\n",
    ")\n",
    "trace2 = Scatter(\n",
    "    x = new_it_words_vec[:, 0],\n",
    "    y = new_it_words_vec[:, 1],\n",
    "    mode = 'markers+text',\n",
    "    text = it_words,\n",
    "    textposition = 'top'\n",
    ")\n",
    "layout = Layout(\n",
    "    showlegend = False,\n",
    "    annotations = [dict(\n",
    "        x = new_it_words_vec[5][0],\n",
    "        y = new_it_words_vec[5][1],\n",
    "        text = translated_word[en_words[4]][0],\n",
    "        arrowcolor = \"black\",\n",
    "        arrowsize = 1.5,\n",
    "        arrowwidth = 1,\n",
    "        arrowhead = 0.5\n",
    "      ), dict(\n",
    "        x = new_it_words_vec[6][0],\n",
    "        y = new_it_words_vec[6][1],\n",
    "        text = translated_word[en_words[4]][1],\n",
    "        arrowcolor = \"black\",\n",
    "        arrowsize = 1.5,\n",
    "        arrowwidth = 1,\n",
    "        arrowhead = 0.5\n",
    "      ), dict(\n",
    "        x = new_it_words_vec[7][0],\n",
    "        y = new_it_words_vec[7][1],\n",
    "        text = translated_word[en_words[4]][2],\n",
    "        arrowcolor = \"black\",\n",
    "        arrowsize = 1.5,\n",
    "        arrowwidth = 1,\n",
    "        arrowhead = 0.5\n",
    "      )]\n",
    ")\n",
    "data = [trace1, trace2]\n",
    "\n",
    "fig = Figure(data=data, layout=layout)\n",
    "plot_url = plotly.offline.iplot(fig, filename='relatie_position_for_numbers.html')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "You probably will see that two kind of different color nodes, one for the English and the other for the Italian. For the translation of word `five`, we return `top 3` similar words `[u'cinque', u'quattro', u'tre']`. We can easily see that the translation is convincing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Let's see some animal words, the figue shows that most of words are also share the similar geometric arrangements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "words = [(\"dog\", \"cane\"), (\"pig\", \"maiale\"), (\"cat\", \"gatto\"), (\"horse\", \"cavallo\"), (\"birds\", \"uccelli\")]\n",
    "en_words_vec = [source_word_vec[item[0]] for item in words]\n",
    "it_words_vec = [target_word_vec[item[1]] for item in words]\n",
    "\n",
    "en_words, it_words = zip(*words)\n",
    "\n",
    "# remove the code, use the plotly for ploting instead\n",
    "# pca = PCA(n_components=2)\n",
    "# new_en_words_vec = pca.fit_transform(en_words_vec)\n",
    "# new_it_words_vec = pca.fit_transform(it_words_vec)\n",
    "\n",
    "# fig = plt.figure()\n",
    "# fig.add_subplot(121)\n",
    "# plt.scatter(new_en_words_vec[:, 0], new_en_words_vec[:, 1])\n",
    "# for idx, item in enumerate(en_words):\n",
    "#     plt.annotate(item, xy=(new_en_words_vec[idx][0], new_en_words_vec[idx][1]))\n",
    "\n",
    "# fig.add_subplot(122)\n",
    "# plt.scatter(new_it_words_vec[:, 0], new_it_words_vec[:, 1])\n",
    "# for idx, item in enumerate(it_words):\n",
    "#     plt.annotate(item, xy=(new_it_words_vec[idx][0], new_it_words_vec[idx][1]))\n",
    "# plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.plotly.v1+json": {
       "data": [
        {
         "mode": "markers+text",
         "text": [
          "dog",
          "pig",
          "cat",
          "horse",
          "birds"
         ],
         "textposition": "top",
         "type": "scatter",
         "x": [
          1.3152927124584963,
          -0.2135391946410942,
          -0.41853131679090877,
          -0.44261965826408217,
          -0.24060254276241194
         ],
         "y": [
          -0.012982020585119488,
          0.6946434653689207,
          0.07943642028572512,
          -0.07689106511677166,
          -0.6842067999527552
         ]
        },
        {
         "mode": "markers+text",
         "text": [
          "cane",
          "maiale",
          "gatto",
          "cavallo",
          "uccelli"
         ],
         "textposition": "top",
         "type": "scatter",
         "x": [
          2.238256809135214,
          -0.001303773153150972,
          -0.3056676679567372,
          -0.43875310919367744,
          -0.374055740840617,
          -0.374055740840617,
          -0.43875310919367744,
          -0.3056676679567372
         ],
         "y": [
          -0.1257325191472076,
          0.7888307462022796,
          0.33178257213847573,
          0.08474468776781709,
          -0.7480763734338287,
          -0.7480763734338287,
          0.08474468776781709,
          0.33178257213847556
         ]
        }
       ],
       "layout": {
        "showlegend": false
       }
      },
      "text/html": [
       "<div id=\"eb31703b-c65c-47cf-ae32-5ca51b1088dc\" style=\"height: 525px; width: 100%;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"eb31703b-c65c-47cf-ae32-5ca51b1088dc\", [{\"textposition\": \"top\", \"text\": [\"dog\", \"pig\", \"cat\", \"horse\", \"birds\"], \"mode\": \"markers+text\", \"y\": [-0.012982020585119488, 0.6946434653689207, 0.07943642028572512, -0.07689106511677166, -0.6842067999527552], \"x\": [1.3152927124584963, -0.2135391946410942, -0.41853131679090877, -0.44261965826408217, -0.24060254276241194], \"type\": \"scatter\"}, {\"textposition\": \"top\", \"text\": [\"cane\", \"maiale\", \"gatto\", \"cavallo\", \"uccelli\"], \"mode\": \"markers+text\", \"y\": [-0.1257325191472076, 0.7888307462022796, 0.33178257213847573, 0.08474468776781709, -0.7480763734338287, -0.7480763734338287, 0.08474468776781709, 0.33178257213847556], \"x\": [2.238256809135214, -0.001303773153150972, -0.3056676679567372, -0.43875310919367744, -0.374055740840617, -0.374055740840617, -0.43875310919367744, -0.3056676679567372], \"type\": \"scatter\"}], {\"showlegend\": false}, {\"linkText\": \"Export to plot.ly\", \"showLink\": true})});</script>"
      ],
      "text/vnd.plotly.v1+html": [
       "<div id=\"eb31703b-c65c-47cf-ae32-5ca51b1088dc\" style=\"height: 525px; width: 100%;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"eb31703b-c65c-47cf-ae32-5ca51b1088dc\", [{\"textposition\": \"top\", \"text\": [\"dog\", \"pig\", \"cat\", \"horse\", \"birds\"], \"mode\": \"markers+text\", \"y\": [-0.012982020585119488, 0.6946434653689207, 0.07943642028572512, -0.07689106511677166, -0.6842067999527552], \"x\": [1.3152927124584963, -0.2135391946410942, -0.41853131679090877, -0.44261965826408217, -0.24060254276241194], \"type\": \"scatter\"}, {\"textposition\": \"top\", \"text\": [\"cane\", \"maiale\", \"gatto\", \"cavallo\", \"uccelli\"], \"mode\": \"markers+text\", \"y\": [-0.1257325191472076, 0.7888307462022796, 0.33178257213847573, 0.08474468776781709, -0.7480763734338287, -0.7480763734338287, 0.08474468776781709, 0.33178257213847556], \"x\": [2.238256809135214, -0.001303773153150972, -0.3056676679567372, -0.43875310919367744, -0.374055740840617, -0.374055740840617, -0.43875310919367744, -0.3056676679567372], \"type\": \"scatter\"}], {\"showlegend\": false}, {\"linkText\": \"Export to plot.ly\", \"showLink\": true})});</script>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "trace1 = Scatter(\n",
    "    x = new_en_words_vec[:, 0],\n",
    "    y = new_en_words_vec[:, 1],\n",
    "    mode = 'markers+text',\n",
    "    text = en_words,\n",
    "    textposition = 'top'\n",
    ")\n",
    "trace2 = Scatter(\n",
    "    x = new_it_words_vec[:, 0],\n",
    "    y = new_it_words_vec[:, 1],\n",
    "    mode = 'markers+text',\n",
    "    text = it_words,\n",
    "    textposition ='top'\n",
    ")\n",
    "layout = Layout(\n",
    "    showlegend = False\n",
    ")\n",
    "data = [trace1, trace2]\n",
    "\n",
    "fig = Figure(data=data, layout=layout)\n",
    "plot_url = plotly.offline.iplot(fig, filename='relatie_position_for_animal.html')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "translation of birds:  OrderedDict([('birds', [u'uccelli', u'garzette', u'iguane'])])\n"
     ]
    }
   ],
   "source": [
    "words = [(\"dog\", \"cane\"), (\"pig\", \"maiale\"), (\"cat\", \"gatto\"), (\"horse\", \"cavallo\"), (\"birds\", \"uccelli\")]\n",
    "en_words, it_words = zip(*words)\n",
    "en_words_vec = [source_word_vec[item[0]] for item in words]\n",
    "it_words_vec = [target_word_vec[item[1]] for item in words]\n",
    "\n",
    "# Translate the English word birds to Italian word\n",
    "translated_word = transmat.translate([en_words[4]], 3)\n",
    "print \"translation of birds: \", translated_word\n",
    "\n",
    "# the translated words of birds\n",
    "for item in translated_word[en_words[4]]:\n",
    "    it_words_vec.append(target_word_vec[item])\n",
    "\n",
    "pca = PCA(n_components=2)\n",
    "new_en_words_vec = pca.fit_transform(en_words_vec)\n",
    "new_it_words_vec = pca.fit_transform(it_words_vec)\n",
    "\n",
    "# # remove the code, use the plotly for ploting instead\n",
    "# fig = plt.figure()\n",
    "# fig.add_subplot(121)\n",
    "# plt.scatter(new_en_words_vec[:, 0], new_en_words_vec[:, 1])\n",
    "# for idx, item in enumerate(en_words):\n",
    "#     plt.annotate(item, xy=(new_en_words_vec[idx][0], new_en_words_vec[idx][1]))\n",
    "\n",
    "# fig.add_subplot(122)\n",
    "# plt.scatter(new_it_words_vec[:, 0], new_it_words_vec[:, 1])\n",
    "# for idx, item in enumerate(it_words):\n",
    "#     plt.annotate(item, xy=(new_it_words_vec[idx][0], new_it_words_vec[idx][1]))\n",
    "# # annote for the translation of five, the red text annotation is the translation of five\n",
    "# for idx, item in enumerate(translated_word[en_words[4]]):\n",
    "#     plt.annotate(item, xy=(new_it_words_vec[idx + 5][0], new_it_words_vec[idx + 5][1]),\n",
    "#                  xytext=(new_it_words_vec[idx + 5][0] + 0.1, new_it_words_vec[idx + 5][1] + 0.1),\n",
    "#                  color=\"red\",\n",
    "#                  arrowprops=dict(facecolor='red', shrink=0.1, width=1, headwidth=2),)\n",
    "# plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.plotly.v1+json": {
       "data": [
        {
         "mode": "markers+text",
         "text": [
          "dog",
          "pig",
          "cat",
          "horse",
          "birds"
         ],
         "textposition": "top",
         "type": "scatter",
         "x": [
          -0.7834409574097115,
          -0.24918978502602576,
          -0.3596040571376046,
          -1.1177974088927534,
          2.510032208466096
         ],
         "y": [
          -0.0023674114362893157,
          1.7359129383593168,
          0.5183138512392008,
          -1.728125363648011,
          -0.5237340145142162
         ]
        },
        {
         "mode": "markers+text",
         "text": [
          "cane",
          "maiale",
          "gatto",
          "cavallo",
          "uccelli"
         ],
         "textposition": "top",
         "type": "scatter",
         "x": [
          -1.2409943327592234,
          -1.7527728653051347,
          -1.2766374897944894,
          -1.06059013714337,
          2.3863134870161686
         ],
         "y": [
          -1.3573964620591614,
          2.405608050256626,
          -1.1248098791692556,
          -0.28287459942417575,
          0.16446671644015692
         ]
        }
       ],
       "layout": {
        "annotations": [
         {
          "arrowcolor": "black",
          "arrowhead": 0.5,
          "arrowsize": 1.5,
          "arrowwidth": 1,
          "text": "uccelli",
          "x": 2.3863134870161686,
          "y": 0.1644667164401571
         },
         {
          "arrowcolor": "black",
          "arrowhead": 0.5,
          "arrowsize": 1.5,
          "arrowwidth": 1,
          "text": "garzette",
          "x": 0.29281044463499584,
          "y": 0.10738109697864884
         },
         {
          "arrowcolor": "black",
          "arrowhead": 0.5,
          "arrowsize": 1.5,
          "arrowwidth": 1,
          "text": "iguane",
          "x": 0.2655574063348864,
          "y": -0.07684163946299687
         }
        ],
        "showlegend": false
       }
      },
      "text/html": [
       "<div id=\"0b89f234-d8a0-4273-8138-7e95ab0752b6\" style=\"height: 525px; width: 100%;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"0b89f234-d8a0-4273-8138-7e95ab0752b6\", [{\"textposition\": \"top\", \"text\": [\"dog\", \"pig\", \"cat\", \"horse\", \"birds\"], \"mode\": \"markers+text\", \"y\": [-0.0023674114362893157, 1.7359129383593168, 0.5183138512392008, -1.728125363648011, -0.5237340145142162], \"x\": [-0.7834409574097115, -0.24918978502602576, -0.3596040571376046, -1.1177974088927534, 2.510032208466096], \"type\": \"scatter\"}, {\"textposition\": \"top\", \"text\": [\"cane\", \"maiale\", \"gatto\", \"cavallo\", \"uccelli\"], \"mode\": \"markers+text\", \"y\": [-1.3573964620591614, 2.405608050256626, -1.1248098791692556, -0.28287459942417575, 0.16446671644015692], \"x\": [-1.2409943327592234, -1.7527728653051347, -1.2766374897944894, -1.06059013714337, 2.3863134870161686], \"type\": \"scatter\"}], {\"showlegend\": false, \"annotations\": [{\"arrowhead\": 0.5, \"text\": \"uccelli\", \"arrowsize\": 1.5, \"y\": 0.1644667164401571, \"x\": 2.3863134870161686, \"arrowwidth\": 1, \"arrowcolor\": \"black\"}, {\"arrowhead\": 0.5, \"text\": \"garzette\", \"arrowsize\": 1.5, \"y\": 0.10738109697864884, \"x\": 0.29281044463499584, \"arrowwidth\": 1, \"arrowcolor\": \"black\"}, {\"arrowhead\": 0.5, \"text\": \"iguane\", \"arrowsize\": 1.5, \"y\": -0.07684163946299687, \"x\": 0.2655574063348864, \"arrowwidth\": 1, \"arrowcolor\": \"black\"}]}, {\"linkText\": \"Export to plot.ly\", \"showLink\": true})});</script>"
      ],
      "text/vnd.plotly.v1+html": [
       "<div id=\"0b89f234-d8a0-4273-8138-7e95ab0752b6\" style=\"height: 525px; width: 100%;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"0b89f234-d8a0-4273-8138-7e95ab0752b6\", [{\"textposition\": \"top\", \"text\": [\"dog\", \"pig\", \"cat\", \"horse\", \"birds\"], \"mode\": \"markers+text\", \"y\": [-0.0023674114362893157, 1.7359129383593168, 0.5183138512392008, -1.728125363648011, -0.5237340145142162], \"x\": [-0.7834409574097115, -0.24918978502602576, -0.3596040571376046, -1.1177974088927534, 2.510032208466096], \"type\": \"scatter\"}, {\"textposition\": \"top\", \"text\": [\"cane\", \"maiale\", \"gatto\", \"cavallo\", \"uccelli\"], \"mode\": \"markers+text\", \"y\": [-1.3573964620591614, 2.405608050256626, -1.1248098791692556, -0.28287459942417575, 0.16446671644015692], \"x\": [-1.2409943327592234, -1.7527728653051347, -1.2766374897944894, -1.06059013714337, 2.3863134870161686], \"type\": \"scatter\"}], {\"showlegend\": false, \"annotations\": [{\"arrowhead\": 0.5, \"text\": \"uccelli\", \"arrowsize\": 1.5, \"y\": 0.1644667164401571, \"x\": 2.3863134870161686, \"arrowwidth\": 1, \"arrowcolor\": \"black\"}, {\"arrowhead\": 0.5, \"text\": \"garzette\", \"arrowsize\": 1.5, \"y\": 0.10738109697864884, \"x\": 0.29281044463499584, \"arrowwidth\": 1, \"arrowcolor\": \"black\"}, {\"arrowhead\": 0.5, \"text\": \"iguane\", \"arrowsize\": 1.5, \"y\": -0.07684163946299687, \"x\": 0.2655574063348864, \"arrowwidth\": 1, \"arrowcolor\": \"black\"}]}, {\"linkText\": \"Export to plot.ly\", \"showLink\": true})});</script>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "trace1 = Scatter(\n",
    "    x = new_en_words_vec[:, 0],\n",
    "    y = new_en_words_vec[:, 1],\n",
    "    mode = 'markers+text',\n",
    "    text = en_words,\n",
    "    textposition = 'top'\n",
    ")\n",
    "trace2 = Scatter(\n",
    "    x = new_it_words_vec[:5, 0],\n",
    "    y = new_it_words_vec[:5, 1],\n",
    "    mode = 'markers+text',\n",
    "    text = it_words[:5],\n",
    "    textposition = 'top'\n",
    ")\n",
    "layout = Layout(\n",
    "    showlegend = False,\n",
    "    annotations = [dict(\n",
    "        x = new_it_words_vec[5][0],\n",
    "        y = new_it_words_vec[5][1],\n",
    "        text = translated_word[en_words[4]][0],\n",
    "        arrowcolor = \"black\",\n",
    "        arrowsize = 1.5,\n",
    "        arrowwidth = 1,\n",
    "        arrowhead = 0.5\n",
    "      ), dict(\n",
    "        x = new_it_words_vec[6][0],\n",
    "        y = new_it_words_vec[6][1],\n",
    "        text = translated_word[en_words[4]][1],\n",
    "        arrowcolor = \"black\",\n",
    "        arrowsize = 1.5,\n",
    "        arrowwidth = 1,\n",
    "        arrowhead = 0.5\n",
    "      ), dict(\n",
    "        x = new_it_words_vec[7][0],\n",
    "        y = new_it_words_vec[7][1],\n",
    "        text = translated_word[en_words[4]][2],\n",
    "        arrowcolor = \"black\",\n",
    "        arrowsize = 1.5,\n",
    "        arrowwidth = 1,\n",
    "        arrowhead = 0.5\n",
    "      )]\n",
    ")\n",
    "data = [trace1, trace2]\n",
    "\n",
    "fig = Figure(data=data, layout=layout)\n",
    "plot_url = plotly.offline.iplot(fig, filename='relatie_position_for_animal.html')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "You probably will see that two kind of different color nodes, one for the English and the other for the Italian. For the translation of word `birds`, we return `top 3` similar words `[u'uccelli', u'garzette', u'iguane']`. We can easily see that the animals' words translation is also convincing as the numbers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Tranlation Matrix Revisit \n",
    "## Warning: this part is unstable/experimental, it requires more experimentation and will change soon!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "As dicussion in this [PR](https://github.com/RaRe-Technologies/gensim/pull/1434), Translation Matrix not only can used to translate the words from one source language to another target lanuage, but also to translate new document vectors back to old model space.\n",
    "\n",
    "For example, if we have trained 15k documents using doc2vec (we called this as model1), and we are going to train new 35k documents using doc2vec(we called this as model2). So we can include those 15k documents as reference documents into the new 35k documents. Then we can get 15k document vectors from model1 and 50k document vectors from model2, but both of the two models have vectors for those 15k documents. We can use those vectors to build a mapping from model1 to model2. Finally, with this relation, we can back-mapping the model2's vector to model1. Therefore, 35k document vectors are learned using this method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "In this notebook, we use the IMDB dataset as example. For more information about this dataset, please refer to [this](http://ai.stanford.edu/~amaas/data/sentiment/). And some of code are borrowed from this [notebook](http://localhost:8888/notebooks/docs/notebooks/doc2vec-IMDB.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100000 docs: 25000 train-sentiment, 25000 test-sentiment\n",
      "25000 25000 100000 15000 50000\n"
     ]
    }
   ],
   "source": [
    "import gensim\n",
    "from gensim.models.doc2vec import TaggedDocument\n",
    "from gensim.models import Doc2Vec\n",
    "from collections import namedtuple\n",
    "from gensim import utils\n",
    "\n",
    "def read_sentimentDocs():\n",
    "    SentimentDocument = namedtuple('SentimentDocument', 'words tags split sentiment')\n",
    "\n",
    "    alldocs = []  # will hold all docs in original order\n",
    "    with utils.smart_open('aclImdb/alldata-id.txt', encoding='utf-8') as alldata:\n",
    "        for line_no, line in enumerate(alldata):\n",
    "            tokens = gensim.utils.to_unicode(line).split()\n",
    "            words = tokens[1:]\n",
    "            tags = [line_no] # `tags = [tokens[0]]` would also work at extra memory cost\n",
    "            split = ['train','test','extra','extra'][line_no // 25000]  # 25k train, 25k test, 25k extra\n",
    "            sentiment = [1.0, 0.0, 1.0, 0.0, None, None, None, None][line_no // 12500] # [12.5K pos, 12.5K neg]*2 then unknown\n",
    "            alldocs.append(SentimentDocument(words, tags, split, sentiment))\n",
    "\n",
    "    train_docs = [doc for doc in alldocs if doc.split == 'train']\n",
    "    test_docs = [doc for doc in alldocs if doc.split == 'test']\n",
    "    doc_list = alldocs[:]  # for reshuffling per pass\n",
    "\n",
    "    print('%d docs: %d train-sentiment, %d test-sentiment' % (len(doc_list), len(train_docs), len(test_docs)))\n",
    "\n",
    "    return train_docs, test_docs, doc_list\n",
    "\n",
    "train_docs, test_docs, doc_list = read_sentimentDocs()\n",
    "\n",
    "small_corpus = train_docs[:15000]\n",
    "large_corpus = train_docs + test_docs\n",
    "\n",
    "print len(train_docs), len(test_docs), len(doc_list), len(small_corpus), len(large_corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Here, we train two Doc2vec model, the parameters can be determined by yourself. We trained on 15k documents for the `model1` and 50k documents for the `model2`. But you should mixed some documents which from the 15k document in `model` to the `model2` as dicussed before. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# for the computer performance limited, didn't run on the notebook. \n",
    "# You do can trained on the server and save the model to the disk.\n",
    "import multiprocessing\n",
    "from random import shuffle\n",
    "\n",
    "cores = multiprocessing.cpu_count()\n",
    "model1 = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores)\n",
    "model2 = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores)\n",
    "\n",
    "small_train_docs = train_docs[:15000]\n",
    "# train for small corpus\n",
    "model1.build_vocab(small_train_docs)\n",
    "for epoch in xrange(50):\n",
    "    shuffle(small_train_docs)\n",
    "    model1.train(small_train_docs, total_examples=len(small_train_docs), epochs=1)\n",
    "model.save(\"small_doc_15000_iter50.bin\")\n",
    "\n",
    "large_train_docs = train_docs + test_docs\n",
    "# train for large corpus\n",
    "model2.build_vocab(large_train_docs)\n",
    "for epoch in xrange(50):\n",
    "    shuffle(large_train_docs)\n",
    "    model2.train(large_train_docs, total_examples=len(train_docs), epochs=1)\n",
    "# save the model\n",
    "model2.save(\"large_doc_50000_iter50.bin\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "For the IMDB training dataset, we train an classifier on the train data which has 25k documents with positive and negative label. Then using this classifier to predict the test data. To see what accuracy can the document vectors which learned by different method achieve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "def test_classifier_error(train, train_label, test, test_label):\n",
    "    classifier = LogisticRegression()\n",
    "    classifier.fit(train, train_label)\n",
    "    score = classifier.score(test, test_label)\n",
    "    print \"the classifier score :\", score\n",
    "    return score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "For the experiment one, we use the vector which learned by the Doc2vec method.To evalute those document vector, we use split those 50k document into two part, one for training and the other for testing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The vectors are learned by doc2vec method\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py:526: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
      "  y = column_or_1d(y, warn=True)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the classifier score : 0.83372\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.83372000000000002"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#you can change the data folder\n",
    "basedir = \"/home/robotcator/doc2vec\"\n",
    "\n",
    "model2 = Doc2Vec.load(os.path.join(basedir, \"large_doc_50000_iter50.bin\"))\n",
    "m2 = []\n",
    "for i in range(len(large_corpus)):\n",
    "    m2.append(model2.docvecs[large_corpus[i].tags])\n",
    "\n",
    "train_array = np.zeros((25000, 100))\n",
    "train_label = np.zeros((25000, 1))\n",
    "test_array = np.zeros((25000, 100))\n",
    "test_label = np.zeros((25000, 1))\n",
    "\n",
    "for i in range(12500):\n",
    "    train_array[i] = m2[i]\n",
    "    train_label[i] = 1\n",
    "\n",
    "    train_array[i + 12500] = m2[i + 12500]\n",
    "    train_label[i + 12500] = 0\n",
    "\n",
    "    test_array[i] = m2[i + 25000]\n",
    "    test_label[i] = 1\n",
    "\n",
    "    test_array[i + 12500] = m2[i + 37500]\n",
    "    test_label[i + 12500] = 0\n",
    "\n",
    "print \"The vectors are learned by doc2vec method\"\n",
    "test_classifier_error(train_array, train_label, test_array, test_label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "For the experiment two, the document vectors are learned by the back-mapping method, which has a linear mapping for the `model1` and `model2`. Using this method like translation matrix for the word translation, If we provide the vector for the addtional 35k document vector in `model2`, we can infer this vector for the `model1`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The vectors are learned by back-mapping method\n",
      "the classifier score : 0.79768\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.79767999999999994"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from gensim.models import translation_matrix\n",
    "# you can change the data folder\n",
    "basedir = \"/home/robotcator/doc2vec\"\n",
    "\n",
    "model1 = Doc2Vec.load(os.path.join(basedir, \"small_doc_15000_iter50.bin\"))\n",
    "model2 = Doc2Vec.load(os.path.join(basedir, \"large_doc_50000_iter50.bin\"))\n",
    "\n",
    "l = model1.docvecs.count\n",
    "l2 = model2.docvecs.count\n",
    "m1 = np.array([model1.docvecs[large_corpus[i].tags].flatten() for i in range(l)])\n",
    "\n",
    "# learn the mapping bettween two model\n",
    "model = translation_matrix.BackMappingTranslationMatrix(large_corpus[:15000], model1, model2)\n",
    "model.train(large_corpus[:15000])\n",
    "\n",
    "for i in range(l, l2):\n",
    "    infered_vec = model.infer_vector(model2.docvecs[large_corpus[i].tags])\n",
    "    m1 = np.vstack((m1, infered_vec.flatten()))\n",
    "\n",
    "train_array = np.zeros((25000, 100))\n",
    "train_label = np.zeros((25000, 1))\n",
    "test_array = np.zeros((25000, 100))\n",
    "test_label = np.zeros((25000, 1))\n",
    "\n",
    "# because those document, 25k documents are postive label, 25k documents are negative label\n",
    "for i in range(12500):\n",
    "    train_array[i] = m1[i]\n",
    "    train_label[i] = 1\n",
    "\n",
    "    train_array[i + 12500] = m1[i + 12500]\n",
    "    train_label[i + 12500] = 0\n",
    "\n",
    "    test_array[i] = m1[i + 25000]\n",
    "    test_label[i] = 1\n",
    "\n",
    "    test_array[i + 12500] = m1[i + 37500]\n",
    "    test_label[i + 12500] = 0\n",
    "\n",
    "print \"The vectors are learned by back-mapping method\"\n",
    "test_classifier_error(train_array, train_label, test_array, test_label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "As we can see that, the vectors learned by back-mapping method performed not bad but still need improved."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### Visulization\n",
    " we pick some documents and extract the vector both from `model1` and `model2`, we can see that they also share the similar geometric arrangment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<script>requirejs.config({paths: { 'plotly': ['https://cdn.plot.ly/plotly-latest.min']},});if(!window.Plotly) {{require(['plotly'],function(plotly) {window.Plotly=plotly;});}}</script>"
      ],
      "text/vnd.plotly.v1+html": [
       "<script>requirejs.config({paths: { 'plotly': ['https://cdn.plot.ly/plotly-latest.min']},});if(!window.Plotly) {{require(['plotly'],function(plotly) {window.Plotly=plotly;});}}</script>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "\n",
    "import plotly\n",
    "from plotly.graph_objs import Scatter, Layout, Figure\n",
    "plotly.offline.init_notebook_mode(connected=True)\n",
    "\n",
    "m1_part = m1[14995: 15000]\n",
    "m2_part = m2[14995: 15000]\n",
    "\n",
    "m1_part = np.array(m1_part).reshape(len(m1_part), 100)\n",
    "m2_part = np.array(m2_part).reshape(len(m2_part), 100)\n",
    "\n",
    "pca = PCA(n_components=2)\n",
    "reduced_vec1 = pca.fit_transform(m1_part)\n",
    "reduced_vec2 = pca.fit_transform(m2_part)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.plotly.v1+json": {
       "data": [
        {
         "mode": "markers+text",
         "text": [
          "doc0",
          "doc1",
          "doc2",
          "doc3",
          "doc4"
         ],
         "textposition": "top",
         "type": "scatter",
         "x": [
          0.07051670680398692,
          -1.0102077577398751,
          3.1857742321798215,
          -0.09972059557783772,
          -2.146362585666094
         ],
         "y": [
          -0.5189691969210101,
          -2.246528790821405,
          0.9159540904585602,
          -0.5770896605559841,
          2.426633557839838
         ]
        },
        {
         "mode": "markers+text",
         "text": [
          "doc0",
          "doc1",
          "doc2",
          "doc3",
          "doc4"
         ],
         "textposition": "top",
         "type": "scatter",
         "x": [
          -0.011053430964214062,
          -1.1904447145193158,
          3.5412222519049386,
          -0.9894757593888498,
          -1.3502483470325575
         ],
         "y": [
          0.012735965193926151,
          -2.356435297960164,
          0.13346781751478554,
          -0.8130964900528699,
          3.023328005304322
         ]
        }
       ],
       "layout": {
        "showlegend": false
       }
      },
      "text/html": [
       "<div id=\"d064b2bb-88a9-406f-a27c-c708e1301d5e\" style=\"height: 525px; width: 100%;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"d064b2bb-88a9-406f-a27c-c708e1301d5e\", [{\"textposition\": \"top\", \"text\": [\"doc0\", \"doc1\", \"doc2\", \"doc3\", \"doc4\"], \"mode\": \"markers+text\", \"y\": [-0.5189691969210101, -2.246528790821405, 0.9159540904585602, -0.5770896605559841, 2.426633557839838], \"x\": [0.07051670680398692, -1.0102077577398751, 3.1857742321798215, -0.09972059557783772, -2.146362585666094], \"type\": \"scatter\"}, {\"textposition\": \"top\", \"text\": [\"doc0\", \"doc1\", \"doc2\", \"doc3\", \"doc4\"], \"mode\": \"markers+text\", \"y\": [0.012735965193926151, -2.356435297960164, 0.13346781751478554, -0.8130964900528699, 3.023328005304322], \"x\": [-0.011053430964214062, -1.1904447145193158, 3.5412222519049386, -0.9894757593888498, -1.3502483470325575], \"type\": \"scatter\"}], {\"showlegend\": false}, {\"linkText\": \"Export to plot.ly\", \"showLink\": true})});</script>"
      ],
      "text/vnd.plotly.v1+html": [
       "<div id=\"d064b2bb-88a9-406f-a27c-c708e1301d5e\" style=\"height: 525px; width: 100%;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"d064b2bb-88a9-406f-a27c-c708e1301d5e\", [{\"textposition\": \"top\", \"text\": [\"doc0\", \"doc1\", \"doc2\", \"doc3\", \"doc4\"], \"mode\": \"markers+text\", \"y\": [-0.5189691969210101, -2.246528790821405, 0.9159540904585602, -0.5770896605559841, 2.426633557839838], \"x\": [0.07051670680398692, -1.0102077577398751, 3.1857742321798215, -0.09972059557783772, -2.146362585666094], \"type\": \"scatter\"}, {\"textposition\": \"top\", \"text\": [\"doc0\", \"doc1\", \"doc2\", \"doc3\", \"doc4\"], \"mode\": \"markers+text\", \"y\": [0.012735965193926151, -2.356435297960164, 0.13346781751478554, -0.8130964900528699, 3.023328005304322], \"x\": [-0.011053430964214062, -1.1904447145193158, 3.5412222519049386, -0.9894757593888498, -1.3502483470325575], \"type\": \"scatter\"}], {\"showlegend\": false}, {\"linkText\": \"Export to plot.ly\", \"showLink\": true})});</script>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "trace1 = Scatter(\n",
    "    x = reduced_vec1[:, 0],\n",
    "    y = reduced_vec1[:, 1],\n",
    "    mode = 'markers+text',\n",
    "    text = ['doc' + str(i) for i in range(len(reduced_vec1))],\n",
    "    textposition = 'top'\n",
    ")\n",
    "trace2 = Scatter(\n",
    "    x = reduced_vec2[:, 0],\n",
    "    y = reduced_vec2[:, 1],\n",
    "    mode = 'markers+text',\n",
    "    text = ['doc' + str(i) for i in range(len(reduced_vec1))],\n",
    "    textposition ='top'\n",
    ")\n",
    "layout = Layout(\n",
    "    showlegend = False\n",
    ")\n",
    "data = [trace1, trace2]\n",
    "\n",
    "fig = Figure(data=data, layout=layout)\n",
    "plot_url = plotly.offline.iplot(fig, filename='doc_vec_vis')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "application/vnd.plotly.v1+json": {
       "data": [
        {
         "mode": "markers+text",
         "text": [
          "sdoc0",
          "sdoc1",
          "sdoc2",
          "sdoc3",
          "sdoc4",
          "sdoc5",
          "sdoc6"
         ],
         "textposition": "top",
         "type": "scatter",
         "x": [
          0.04158070321494938,
          -1.0480734663754014,
          3.002603402152576,
          -0.3126488313897585,
          -2.3445114105964766,
          0.18737105057985928,
          0.4736785524142543
         ],
         "y": [
          -0.48619949120991746,
          -2.218257278235836,
          1.1661253585758709,
          -0.13988024172066588,
          2.388154917904303,
          -0.298685545053014,
          -0.4112577202607385
         ]
        },
        {
         "mode": "markers+text",
         "text": [
          "tdoc0",
          "tdoc1",
          "tdoc2",
          "tdoc3",
          "tdoc4",
          "tdoc5",
          "tdoc6"
         ],
         "textposition": "top",
         "type": "scatter",
         "x": [
          0.06096140712493523,
          -1.680577246175968,
          3.393929811676195,
          -0.18136268486571286,
          -0.8359715950914588,
          -1.4302173299991963,
          0.6732376373312039
         ],
         "y": [
          0.1419609400305979,
          -1.532885470324882,
          -0.3535097325934502,
          1.3402699793233748,
          2.9078194104268373,
          -1.4140948773566353,
          -1.089560249505842
         ]
        }
       ],
       "layout": {
        "showlegend": false
       }
      },
      "text/html": [
       "<div id=\"bcd06f55-a61e-49bd-9741-1b81e7f56007\" style=\"height: 525px; width: 100%;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"bcd06f55-a61e-49bd-9741-1b81e7f56007\", [{\"textposition\": \"top\", \"text\": [\"sdoc0\", \"sdoc1\", \"sdoc2\", \"sdoc3\", \"sdoc4\", \"sdoc5\", \"sdoc6\"], \"mode\": \"markers+text\", \"y\": [-0.48619949120991746, -2.218257278235836, 1.1661253585758709, -0.13988024172066588, 2.388154917904303, -0.298685545053014, -0.4112577202607385], \"x\": [0.04158070321494938, -1.0480734663754014, 3.002603402152576, -0.3126488313897585, -2.3445114105964766, 0.18737105057985928, 0.4736785524142543], \"type\": \"scatter\"}, {\"textposition\": \"top\", \"text\": [\"tdoc0\", \"tdoc1\", \"tdoc2\", \"tdoc3\", \"tdoc4\", \"tdoc5\", \"tdoc6\"], \"mode\": \"markers+text\", \"y\": [0.1419609400305979, -1.532885470324882, -0.3535097325934502, 1.3402699793233748, 2.9078194104268373, -1.4140948773566353, -1.089560249505842], \"x\": [0.06096140712493523, -1.680577246175968, 3.393929811676195, -0.18136268486571286, -0.8359715950914588, -1.4302173299991963, 0.6732376373312039], \"type\": \"scatter\"}], {\"showlegend\": false}, {\"linkText\": \"Export to plot.ly\", \"showLink\": true})});</script>"
      ],
      "text/vnd.plotly.v1+html": [
       "<div id=\"bcd06f55-a61e-49bd-9741-1b81e7f56007\" style=\"height: 525px; width: 100%;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"bcd06f55-a61e-49bd-9741-1b81e7f56007\", [{\"textposition\": \"top\", \"text\": [\"sdoc0\", \"sdoc1\", \"sdoc2\", \"sdoc3\", \"sdoc4\", \"sdoc5\", \"sdoc6\"], \"mode\": \"markers+text\", \"y\": [-0.48619949120991746, -2.218257278235836, 1.1661253585758709, -0.13988024172066588, 2.388154917904303, -0.298685545053014, -0.4112577202607385], \"x\": [0.04158070321494938, -1.0480734663754014, 3.002603402152576, -0.3126488313897585, -2.3445114105964766, 0.18737105057985928, 0.4736785524142543], \"type\": \"scatter\"}, {\"textposition\": \"top\", \"text\": [\"tdoc0\", \"tdoc1\", \"tdoc2\", \"tdoc3\", \"tdoc4\", \"tdoc5\", \"tdoc6\"], \"mode\": \"markers+text\", \"y\": [0.1419609400305979, -1.532885470324882, -0.3535097325934502, 1.3402699793233748, 2.9078194104268373, -1.4140948773566353, -1.089560249505842], \"x\": [0.06096140712493523, -1.680577246175968, 3.393929811676195, -0.18136268486571286, -0.8359715950914588, -1.4302173299991963, 0.6732376373312039], \"type\": \"scatter\"}], {\"showlegend\": false}, {\"linkText\": \"Export to plot.ly\", \"showLink\": true})});</script>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "m1_part = m1[14995: 15002]\n",
    "m2_part = m2[14995: 15002]\n",
    "\n",
    "m1_part = np.array(m1_part).reshape(len(m1_part), 100)\n",
    "m2_part = np.array(m2_part).reshape(len(m2_part), 100)\n",
    "\n",
    "pca = PCA(n_components=2)\n",
    "reduced_vec1 = pca.fit_transform(m1_part)\n",
    "reduced_vec2 = pca.fit_transform(m2_part)\n",
    "\n",
    "trace1 = Scatter(\n",
    "    x = reduced_vec1[:, 0],\n",
    "    y = reduced_vec1[:, 1],\n",
    "    mode = 'markers+text',\n",
    "    text = ['sdoc' + str(i) for i in range(len(reduced_vec1))],\n",
    "    textposition = 'top'\n",
    ")\n",
    "trace2 = Scatter(\n",
    "    x = reduced_vec2[:, 0],\n",
    "    y = reduced_vec2[:, 1],\n",
    "    mode = 'markers+text',\n",
    "    text = ['tdoc' + str(i) for i in range(len(reduced_vec1))],\n",
    "    textposition ='top'\n",
    ")\n",
    "layout = Layout(\n",
    "    showlegend = False\n",
    ")\n",
    "data = [trace1, trace2]\n",
    "\n",
    "fig = Figure(data=data, layout=layout)\n",
    "plot_url = plotly.offline.iplot(fig, filename='doc_vec_vis')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "You probably will see kinds of colors point. One for the `model1`, the `sdoc0` to `sdoc4` document vector are learned by Doc2vec and `sdoc5` and `sdoc6` are learned by back-mapping.  One for the `model2`, the `tdoc0` to `tdoc6` are learned by Doc2vec. We can see that some of points learned from the back-mapping method still have the relative position with the point learned by Doc2vec."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
